Data Analysis: RecipeDB

2. Principal Component Analysis

In [1]:
#load libraries
import numpy as np
import pandas as pd 
import csv

from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

import plotly
import plotly.express as px
import seaborn as sns
import matplotlib.pyplot as plt
In [2]:
df = pd.read_csv('datasets/dataset1.csv',low_memory=False)
df = pd.DataFrame(df)

#columns 9 to 158 contain the required information
data = df.iloc[:,9:159]
data = data.fillna(0)

Principal Component Analysis (PCA) is a statistical technique used for dimensionality reduction. It transforms a large set of features or variables into a smaller set, while simultaneously preserving as much information as possible.

Here, we have used PCA to detect outliers amongst all the recipes based on their nutritional values.
The recipes have been categorised based on 6 continents - Europe, Asia, Africa, South (Latin) America, North America and Australia.

In [3]:
Y = df['Continent']

#scale the data
scaled_data = preprocessing.scale(data)

#create two principal components
pca = PCA(n_components=2)
pca.fit(scaled_data)
components = pca.transform(scaled_data)

#to plot
fig = px.scatter(components, x=0, y=1, color=df['Continent'], hover_name=df['Recipe_title'], title='PCA - Recipes')

#save the plot
plotly.offline.plot(fig, filename='plots/pca_recipes.html')

fig.show()

We see information about 3 prominent outliers:

I. Jessica's Mauritian Chicken Curry

In [4]:
idx = df[df['Recipe_title']=="Jessica's Mauritian Chicken Curry"].index.values
data.iloc[idx]
Out[4]:
Adjusted Protein (g) Alanine (g) Alcohol, ethyl (g) Arginine (g) Ash (g) Aspartic acid (g) Beta-sitosterol (mg) Betaine (mg) Caffeine (mg) Calcium, Ca (mg) ... Vitamin C, total ascorbic acid (mg) Vitamin D (D2 + D3) (g) Vitamin D (IU) Vitamin D2 (ergocalciferol) (g) Vitamin D3 (cholecalciferol) (g) Vitamin E (alpha-tocopherol) (mg) Vitamin E, added (mg) Vitamin K (phylloquinone) (g) Water (g) Zinc, Zn (mg)
16780 0.0 4469.5966 0.0 4659.4356 3416.9517 4874.4803 0.0 56427.9432 0.0 46447.7737 ... 1870.8549 1359.0 40770.0 0.0 1359.0 1845.7586 0.0 2873.7468 326987.2132 4458.7217

1 rows × 150 columns

II. Malai Seekh Kebab

In [5]:
idx = df[df['Recipe_title']=="Malai Seekh Kebabs For Iftar"].index.values
data.iloc[idx]
Out[5]:
Adjusted Protein (g) Alanine (g) Alcohol, ethyl (g) Arginine (g) Ash (g) Aspartic acid (g) Beta-sitosterol (mg) Betaine (mg) Caffeine (mg) Calcium, Ca (mg) ... Vitamin C, total ascorbic acid (mg) Vitamin D (D2 + D3) (g) Vitamin D (IU) Vitamin D2 (ergocalciferol) (g) Vitamin D3 (cholecalciferol) (g) Vitamin E (alpha-tocopherol) (mg) Vitamin E, added (mg) Vitamin K (phylloquinone) (g) Water (g) Zinc, Zn (mg)
35484 0.0 9820.635 0.0 10890.9515 7062.7184 15611.2014 0.0 169503.3734 0.0 50072.9691 ... 142.2685 1.0 41.0 0.0 1.0 4000.6548 0.0 1.6328 271198.092 29651.756

1 rows × 150 columns

III. Stifado

In [6]:
idx = df[df['Recipe_title']=="Stifado (Traditional Greek Stew)"].index.values
data.iloc[idx]
Out[6]:
Adjusted Protein (g) Alanine (g) Alcohol, ethyl (g) Arginine (g) Ash (g) Aspartic acid (g) Beta-sitosterol (mg) Betaine (mg) Caffeine (mg) Calcium, Ca (mg) ... Vitamin C, total ascorbic acid (mg) Vitamin D (D2 + D3) (g) Vitamin D (IU) Vitamin D2 (ergocalciferol) (g) Vitamin D3 (cholecalciferol) (g) Vitamin E (alpha-tocopherol) (mg) Vitamin E, added (mg) Vitamin K (phylloquinone) (g) Water (g) Zinc, Zn (mg)
78586 0.0 6246.5743 0.0 7266.6385 4047.0674 9951.5176 0.0 88129.6383 0.0 49196.4542 ... 73.0235 408.0009 24480.0555 0.0 408.0009 544.0765 0.0 7280.5232 242133.6913 40190.8205

1 rows × 150 columns

These dishes have very high amounts of each nutrient present in them, as compared to the rest of the recipes and hence are outliers.